NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Dynamic-Width Speculative Beam Decoding for LLM Inference

https://doi.org/10.1609/aaai.v39i23.34690

Qin, Zongyue; He, Zifan; Prakriya, Neha; Cong, Jason; Sun, Yizhou (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

Large language models (LLMs) based on transformer architecture have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, where as beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step.This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling.To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling.Experimental results show that our approach achieves a 1.5-1.9x speed-up and1.8-2.5x lower energy consumption compared to beam sampling, with no loss in downstream performance. Moreover, it can produce significantly higher-quality outputs than speculative decoding, while maintaining similar time, memory, and energy costs. In summary, our method offers a more efficient and effective inference process for LLMs.
more » « less
Free, publicly-accessible full text available April 11, 2026
Optimized Multi-Token Joint Decoding With Auxiliary Model for LLM Inference

Qin, Zongyue; Hu, Ziniu; He, Zifan; Prakriya, Neha; Cong, Jason; Sun, Yizhou (April 2025, The Thirteenth International Conference on Learning Representations (ICLR 2025))

Free, publicly-accessible full text available April 24, 2026
Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis

https://doi.org/10.1609/aaai.v39i17.34033

Li, Weikai; Wang, Ding; Ding, Zijian; Sohrabizadeh, Atefeh; Qin, Zongyue; Cong, Jason; Sun, Yizhou (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

High-level synthesis (HLS) is a widely used tool in designing Field Programmable Gate Array (FPGA). HLS enables FPGA design with software programming languages by compiling the source code into an FPGA circuit. The source code includes a program (called ``kernel'') and several pragmas that instruct hardware synthesis, such as parallelization, pipeline, etc. While it is relatively easy for software developers to design the program, it heavily relies on hardware knowledge to design the pragmas, posing a big challenge for software developers. Recently, different machine learning algorithms, such as GNNs, have been proposed to automate the pragma design via performance prediction. However, when applying the trained model on new kernels, the significant domain shift often leads to unsatisfactory performance. We propose a more domain-generalizable model structure: a two-level hierarchical Mixture of Experts (MoE), that can be flexibly adapted to any GNN model. Different expert networks can learn to deal with different regions in the representation space, and they can utilize similar patterns between the old kernels and new kernels. In the low-level MoE, we apply MoE on three natural granularities of a program: node, basic block, and graph. The high-level MoE learns to aggregate the three granularities for the final decision. To stably train the hierarchical MoE, we further propose a two-stage training method. Extensive experiments verify the effectiveness of the hierarchical MoE.
more » « less
Free, publicly-accessible full text available April 11, 2026
Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis

https://doi.org/10.1145/3670474.3685952

Qin, Zongyue; Bai, Yunsheng; Sohrabizadeh, Atefeh; Ding, Zijian; Hu, Ziniu; Sun, Yizhou; Cong, Jason (September 2024, ACM)

In recent years, domain-specific accelerators (DSAs) have gained popularity for applications such as deep learning and autonomous driving. To facilitate DSA designs, programmers use high-level synthesis (HLS) to compile a high-level description written in C/C++ into a design with low-level hardware description languages that eventually synthesize DSAs on circuits. However, creating a highquality HLS design still demands significant domain knowledge, particularly in microarchitecture decisions expressed as pragmas. Thus, it is desirable to automate such decisions with the help of machine learning for predicting the quality of HLS designs, requiring a deeper understanding of the program that consists of original code and pragmas. Naturally, these programs can be considered as sequence data. In addition, these programs can be compiled and converted into a control data flow graph (CDFG). But existing works either fail to leverage both modalities or combine the two in shallow or coarse ways. We propose ProgSG, a model that allows interaction between the source code sequence modality and the graph modality in a deep and fine-grained way. To alleviate the scarcity of labeled designs, a pre-training method is proposed based on a suite of compiler’s data flow analysis tasks. Experimental results show that ProgSG reduces the RMSE of design performance predictions by up to 22%, and identifies designs with an average of 1.10× and 1.26× (up to 8.17× and 13.31×) performance improvement in design space exploration (DSE) task compared to HARP and AutoDSE, respectively.
more » « less
Full Text Available
Towards a Comprehensive Benchmark for High-Level Synthesis Targeted to FPGAs

Bai, Yunsheng; Sohrabizadeh, Atefeh; Qin, Zongyue; Hu, Ziniu; Sun, Yizhou; Cong, Jason (December 2023, 37th Conference on Neural Information Processing Systems (NeurIPS 2023))

High-level synthesis (HLS) aims to raise the abstraction layer in hardware design, enabling the design of domain-specific accelerators (DSAs) targeted for field- programmable gate arrays (FPGAs) using C/C++ instead of hardware description languages (HDLs). Compiler directives in the form of pragmas play a crucial role in modifying the microarchitecture within the HLS framework. However, the number of possible microarchitectures grows exponentially with the number of pragmas. Moreover, the evaluation of each candidate design using the HLS tool consumes significant time, ranging from minutes to hours, leading to a slow optimization process. To accelerate this process, machine learning models have been used to predict design quality in milliseconds. However, existing open-source datasets for training such models are limited in terms of design complexity and available optimizations. In this paper, we present HLSYN, a new benchmark that addresses these limitations. It contains more complex programs with a wider range of optimization pragmas, making it a comprehensive dataset for training and evaluating design quality prediction models. The HLSYN benchmark consists of 42 unique programs/kernels, each of which has many different pragma configurations, resulting in over 42,000 labeled designs. We conduct an extensive comparison of state-of-the-art baselines to assess their effectiveness in predicting design quality. As an ongoing project, we anticipate expanding the HLSYN benchmark in terms of both quantity and variety of programs to further support the development of this field.
more » « less
Full Text Available
Towards a Comprehensive Benchmark for High-Level Synthesis Targeted to FPGAs

Bai, Yunsheng; Sohrabizadeh, Atefeh; Qin, Zongyue; Hu, Ziniu; Sun, Yizhou; Cong, Jason (December 2023, 37th Conference on Neural Information Processing Systems (NeurIPS 2023))

Full Text Available
Efficient Task Transfer for HLS DSE

https://doi.org/10.1145/3676536.3676723

Ding, Zijian; Sohrabizadeh, Atefeh; Li, Weikai; Qin, Zongyue; Sun, Yizhou; Cong, Jason (October 2024, ACM)

There have been several recent works proposed to utilize model-based optimization methods to improve the productivity of using high-level synthesis (HLS) to design domain-specific architectures. They would replace the time-consuming performance estimation or simulation of design with a proxy model, and automatically insert pragmas to guide hardware optimizations. In this work, we address the challenges associated with high-level synthesis (HLS) design space exploration (DSE) through the evolving landscape of HLS tools. As these tools develop, the quality of results (QoR) from synthesis can vary significantly, complicating the maintenance of optimal design strategies across different toolchains. We introduce Active-CEM, a task transfer learning scheme that leverages a model-based explorer designed to adapt efficiently to changes in toolchains. This approach optimizes sample efficiency by identifying high-quality design configurations under a new toolchain without requiring extensive re-evaluation. We further refine our methodology by incorporating toolchain-invariant modeling. This allows us to predict QoR changes more accurately despite shifts in the black-box implementation of the toolchains. Experiment results on the HLSyn benchmark transitioning to new toolchain show an average performance improvement of 2.38× compared to AutoDSE and a 1.2× improvement over HARP, while also increasing the sample efficiency by 5.75×, and reducing the runtime by 2.7×.
more » « less
Full Text Available
GHashing: Semantic Graph Hashing for Approximate Similarity Search in Graph Databases

https://doi.org/10.1145/3394486.3403257

Qin, Zongyue; Bai, Yunsheng; Sun, Yizhou (January 2020, Proc. of 2020 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’20))

Full Text Available

Search for: All records